In this workshop, we will walk you through the data analysis of a lipidomics dataset, from data preprocessing and data quality control to statistical analyses and interpretation. We will be using the data and analysis published by Tan et al. on ‘Variability of the Plasma Lipidome and Subclinical Coronary Atherosclerosis’ (Tan et al. 2022) as an example for this workshop. The plasma lipidomics data from this study were obtained using liquid chromatography-tandem mass spectrometry( LC-MS/MS) with multiple reaction monitoring (MRM).
In this first part, we will inspect and process the targeted mass spectrometry (MS)-based plasma lipidomics raw dataset, starting from peak areas, moving to data quality control and ending with a table of QC-filtered lipid concentration values. Careful inspection and quality control of an lipidomics raw data is essential for having a high quality dataset. Moreover, the data processing itself can be a source of errors, noise and artefacts. We therefore follow the approach of an automated, but supervised, preprocessing workflow where we perform checks of each preprocessing step.
Figure 1: Data Processing and QC workflow. Processing steps are indicated by blue boxes, quality control steps by the red boxes. Dotted arrows indicate re-run previous steps. (B. Burla, unpublished)
We first load packages used in this part of the workshop. We will
employ several packages from the tidyverse which can be
loaded using library(tidyverse). The here package provides the
function here() that returns the root of the project. broom provides
functions to convert outputs of standard R functions such as
t.test and lm into tidy tables
(dataframes/tibbles). ggpmisc
extends ggplot2 with functions useful for plotting models
and annotations of axes. rgoslin
provides a function to normalize lipid names and return structural
details from lipid names.
rm(list=ls()) # Clear all variables and loaded packages
knitr::opts_chunk$set(collapse = TRUE, echo = TRUE, message = TRUE, warning = FALSE, fig.height=4, fig.width=10, fig.retina=2, comment = "#>", fig.show = "hold")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(here)
## here() starts at /Users/lsibjb/Code/Rsessions/ISLS11
library(broom)
library(ggpmisc)
## Loading required package: ggpp
##
## Attaching package: 'ggpp'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(rgoslin)
here::i_am("Part_1/Part1.rmd")
## here() starts at /Users/lsibjb/Code/Rsessions/ISLS11
We start with loading the table with peak areas, which were obtained after peak integration of LC-MS raw data using MRMkit (Teo et al., n.d.).
It is always good to check the whether the data was imported
correctly, e.g. by inspecting column types. For example, undefined text
indicating missing values (e.g. ND) within numerical columns,
can lead the read_csv() to assign a column as text.
d_orig <- readr::read_csv(file = here("Part_1/data/SPERFECT_SLINGpanel_MRMkit_RawAreas_clean.csv"),col_names = TRUE, trim_ws = TRUE, na = c("NA", "ND", "n.d."))
#> Rows: 519 Columns: 416
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): FILENAME, BATCH, QC_TYPE
#> dbl (413): CE 14:0, CE 15:0, CE 16:0, CE 16:1, CE 16:2, CE 17:0, CE 17:1, CE...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_orig
#> # A tibble: 519 × 416
#> FILEN…¹ BATCH QC_TYPE CE 14…² CE 15…³ CE 16…⁴ CE 16…⁵ CE 16…⁶ CE 17…⁷ CE 17…⁸
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 SBLK-1… Long… SBLK 181. 209. 1806. 584. 56.4 169. 276.
#> 2 PBLK-1… Long… PBLK 47.3 54.6 440. 143. 0.568 19.4 96.9
#> 3 UBLK-1… Long… UBLK 63.1 99.0 354. 122. 38.6 37.9 28.3
#> 4 RQC-1-… Long… RQC 87.3 262. 23404. 3271. 248. 390. 518.
#> 5 RQC-1-… Long… RQC 210. 530. 37327. 4811. 226. 1212. 1451.
#> 6 RQC-1-… Long… RQC 335. 186. 52478. 4923. 307. 1133. 948.
#> 7 RQC-1-… Long… RQC 189. 239. 66109. 5774. 417. 1459. 1123.
#> 8 RQC-1-… Long… RQC 592. 230. 75214. 6100. 256. 1852. 803.
#> 9 RQC-1-… Long… RQC 302. 173. 49464. 6516. 253. 1502. 1483.
#> 10 LT_bat… Long… TQC 168. 370. 46518. 7232. 237. 1380. 1980.
#> # … with 509 more rows, 406 more variables: `CE 18:0` <dbl>, `CE 18:1` <dbl>,
#> # `CE 18:1 d7 (ISTD)` <dbl>, `CE 18:2` <dbl>, `CE 18:3` <dbl>,
#> # `CE 20:1` <dbl>, `CE 20:2` <dbl>, `CE 20:3` <dbl>, `CE 20:4` <dbl>,
#> # `CE 20:5` <dbl>, `CE 22:0` <dbl>, `CE 22:1` <dbl>, `CE 22:4` <dbl>,
#> # `CE 22:5` <dbl>, `CE 22:6` <dbl>, `CE 24:0` <dbl>, `CE 24:1` <dbl>,
#> # `CE 24:4` <dbl>, `CE 24:5` <dbl>, `CE 24:6` <dbl>, `Cer d18:0/16:0` <dbl>,
#> # `Cer d18:0/18:0` <dbl>, `Cer d18:0/20:0` <dbl>, `Cer d18:0/22:0` <dbl>, …
First we clean the imported dataset for easier handling later on.
Inconsistent sample and lipid names can be problematic as well and may
need to be fixed. In our case, we just clean the sample names by
removing .mzML. Since the file/sample names do not contain
any information on the analysis order, we furthermore add the runorder
number RUN_ID as first column, knowing that the imported
data was in the analysis sequence.
In the second step, we convert the data into the long (narrow) format. In the long format, every observation ( = every lipid/sample pair) is a row and columns represent measured variables (e.g. peak area, RT) for each observation (pair) is a row. Especially in this phase of the data analysis the long format is helpful.
Figure 2: Wide vs long-format tables. Columns in blue describe the samples, and in grey peak areas. In red is an additional variable not present in the wide table.
We can convert the format easily in R using the dplyr pivot
functions
d_orig <- d_orig |>
mutate(FILENAME = stringr::str_replace(FILENAME, ".mzML", "")) |>
mutate(RUN_ID = row_number(), .before = 1)
d_long <- d_orig |>
pivot_longer(names_to = "LIPID", values_to = "AREA", cols = -RUN_ID:-QC_TYPE) %>%
arrange(LIPID)
d_long
#> # A tibble: 214,347 × 6
#> RUN_ID FILENAME BATCH QC_TYPE LIPID AREA
#> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 1 SBLK-1 Longit_1 SBLK CE 14:0 181.
#> 2 2 PBLK-1 Longit_1 PBLK CE 14:0 47.3
#> 3 3 UBLK-1 Longit_1 UBLK CE 14:0 63.1
#> 4 4 RQC-1-10 Longit_1 RQC CE 14:0 87.3
#> 5 5 RQC-1-20 Longit_1 RQC CE 14:0 210.
#> 6 6 RQC-1-40 Longit_1 RQC CE 14:0 335.
#> 7 7 RQC-1-60 Longit_1 RQC CE 14:0 189.
#> 8 8 RQC-1-80 Longit_1 RQC CE 14:0 592.
#> 9 9 RQC-1-100 Longit_1 RQC CE 14:0 302.
#> 10 10 LT_batch1_TQC01 Longit_1 TQC CE 14:0 168.
#> # … with 214,337 more rows
Now we are ready to have a first look on how the analysis went. For this we inspect the peak areas of internal standards (ISTDs) over the analysis sequence by plotting them. Different QC samples were included in this analysis, which serve different functions (see also: (Broadhurst et al. 2018)):
# Filter for ISTDs only
d_istd <- d_long %>% filter(str_detect(LIPID, "ISTD"))
#d_plot <- d_long %>% filter(str_detect(LIPID, "ISTD") & str_detect(LIPID, "Cer"))
# Convert QC_TYPE to a factor and sort, to ensure correct layering in plot
d_istd$QC_TYPE <- factor(d_istd$QC_TYPE, c("SAMPLE", "BQC", "TQC", "PBLK","UBLK","RQC"))
d_istd <- d_istd |> arrange(QC_TYPE)
# Define colors and shapes for each QC_TYPE
qc_colors <- c(SAMPLE = "grey50", BQC = "red", TQC = "blue",
PBLK = "green", SBLK = "darkgreen", UBLK = "green2", RQC = "pink3")
qc_shapes <- c(SAMPLE = 1, BQC = 21, TQC = 21,
PBLK = 21, SBLK = 23, UBLK = 1, RQC = 6)
# Plot
p <- ggplot(d_istd, aes(x=RUN_ID, y=AREA)) +
geom_point(aes(colour = QC_TYPE, fill = QC_TYPE, shape = QC_TYPE),
size = 1, alpha =0.7, stroke = 0.3) +
facet_wrap(vars(LIPID), ncol = 5, nrow = 4, scales="free_y") +
scale_shape_manual(na.value = NA, values = qc_shapes) +
scale_fill_manual(values = qc_colors, na.value = NA) +
scale_colour_manual(values = qc_colors, na.value = NA) +
scale_x_continuous(breaks = seq(0, max(d_istd$RUN_ID), by = 100 )) +
scale_y_continuous(limits = c(0, NA)) +
theme_bw(base_size = 8)
ggsave(plot = p, filename = here("Part_1/output/runscatter_ISTD.pdf"),
width = 280, height = 180, units = "mm")
plot(p)
Injected sample amount need to be carefully chose when measuring analytes covering a large concentration range. It is a trade-off between sensitivity and not exceeding the linear range of the measurement, as well as other factors. While protocols define an optimal injected sample amount (volume), the linear range of the the LC-MS system can change, even within an longer analysis sequence. We therefore always check the linear response as a QC, using dilution, or injection volume, series of a pooled QC extract.
Let’s plot the response curves from ISTDs measured at the beginning and end of this run. For this we extract the curve number and relative concentration from the sample name.
d_rqc <- d_long |>
filter(QC_TYPE == "RQC") |>
separate(col = FILENAME,
into = c("TYPE","CURVE_NO","AMOUNT"),
sep = "-",
remove = FALSE, convert = TRUE)
d_rqc$CURVE_NO <- factor(d_rqc$CURVE_NO)
d_rqc$AMOUNT <- as.numeric(d_rqc$AMOUNT)
p <- ggplot(d_rqc |> filter(str_detect(LIPID, "ISTD")),
aes(x=AMOUNT, y=AREA, color = CURVE_NO, group = CURVE_NO)) +
geom_point(size = 2, alpha =0.7, stroke = 0.3) +
facet_wrap(vars(LIPID), ncol = 5, nrow = 4, scales="free_y") +
ggpmisc::stat_poly_line(linewidth = 0.5, se = FALSE) +
ggpmisc::stat_poly_eq(aes(label = after_stat(rr.label)),
size = 2.4,
lineheight = 1, ) +
scale_colour_manual(values = c("1" = "cyan4", "2" ="blue3")) +
scale_x_continuous(limits = c(0, NA)) +
scale_y_continuous(limits = c(0, NA)) +
labs(x = "Rel. Sample Amount (%)") +
theme_bw(base_size = 8)
ggsave(plot = p, filename = here("Part_1/output/reponse_curves.pdf"),
width = 240, height = 160, units = "mm")
plot(p)
Peak integration of large panels covering many (hundreds) of lipid species needs is challenging. Therefore, it is always good to perform check identification in the final raw dataset. Our data is from a revered phase (RP) - LC where the retention time (RT) increased with increasing carbon chain length and increasing saturation. We can therefore plot the chain length and saturation and possiblly identify potential miss annotations.
We import the ‘peak info’ table obtained during peak integration with
MRMkit. This data file also contains the precurson and product ion m/z
values. To obtain the chain length and saturation from each lipid
species, we use the rgoslin
package. rgoslin furthermore converts lipid names of
different ‘dialects’ to a normalized standard name. However,
rgoslin is not able to understand all ‘in-house’ dialects,
so we provide it with cleaned lipid names, which are also found in the
‘peak info’ table.
d_lipids <- readr::read_csv(file = here("Part_1/data/SPERFECT_SLINGpanel_MRMkit_peakQC.csv"), show_col_types = FALSE)
# We get the class names and remove the isotope info in []
d_lipids <- d_lipids |>
mutate(LIPID_tmp = str_replace(LIPID, " O\\-", "-O "),
LIPID_tmp = str_replace(LIPID_tmp, " P\\-", "-P "), .after = LIPID) |>
separate(LIPID_tmp, into = c("CLASS", "CHAINS", "OTHER"),
sep = " ", remove = TRUE, extra = "drop") |>
mutate(LIPID_NAME_CLEAN = str_remove(LIPID_NAME_CLEAN, "\\[.*"))
# rgoslin needs a vector with lipid names
d_goslin <- rgoslin::parseLipidNames(unique(d_lipids$LIPID_NAME_CLEAN))
d_goslin_sel <- d_goslin |>
select(Original.Name, Normalized.Name, Lipid.Maps.Main.Class, Total.C, Total.DB)
d_lipid_info <- d_lipids |> left_join(d_goslin_sel, by=c("LIPID_NAME_CLEAN"="Original.Name"))
And now let’s plot the data. Note that in this example we plot all the species, before we applied any QC. Depending on your workflow this step may be better used after QC filtering.
d_lipid_info$Total.DB <- factor(d_lipid_info$Total.DB)
p <- ggplot(d_lipid_info |> drop_na(Total.C),
aes(x=Total.C, y=RT, color = Total.DB, group = Total.DB)) +
geom_point(size = 2, alpha =0.75, stroke = 0.3) +
facet_wrap(vars(CLASS), ncol = 6, nrow = 6, scales="free") +
ggpmisc::stat_poly_line(method = "lm", linewidth = 0.5, se = FALSE, na.rm = TRUE) +
ggpmisc::stat_poly_eq(aes(label = after_stat(rr.label)),
size = 2.4,
lineheight = 1, na.rm = TRUE ) +
labs(x = "Rel. Sample Amount (%)") +
theme_bw(base_size = 8)
ggsave(plot = p, filename = here("Part_1/output/chain_vs_rt.pdf"),
width = 260, height = 160, units = "mm")
plot(p)
We decide now to start with processing the data. We will use the ISTD to normalize and calculate (“relative”) concentrations, based on the known amount of the spiked-in ISTD and the sample amount.
\[ [LIPID] = \biggl(\frac{Lipid_i}{ISTD_{class}}\biggr) \biggl(\frac{ISTD_{vol}}{Sample_{amount}}\biggr) \]
In the code below, we are first importing a table that maps each measured lipid species with the ISTD it should be normalized with. Additionally, we import a table that defines the concentrations of the ISTD in the spike-in solution. Furthermore, we know that 4.5 µL ISTD solution was spiked into 10 µL plasma. We first join the tables together as shown in Figure 2, then group the table into ISTD groups, divide all lipids of this group with the ISTD and then rowwise calculate the concentrations.
Figure 2: Joining of mapped ISTD and ISTD concentrations
d_istd_map <- readr::read_csv(file = here("Part_1/data/ISTD_mapping.csv"),
col_names = TRUE, trim_ws = TRUE, col_types = "c")
d_istd_conc <- readr::read_csv(file = here("Part_1/data/ISTD_conc.csv"),
col_names = TRUE, trim_ws = TRUE, col_types = "c")
d_processed <- d_long |>
left_join(d_istd_map, by = c("LIPID")) |>
left_join(d_istd_conc, by = c("ISTD")) |>
mutate(isISTD = (LIPID == ISTD)) |>
group_by(ISTD, FILENAME) |>
mutate(normAREA = AREA/AREA[isISTD],
CONC = normAREA * RF * ISTD_conc_nM / 1000 * 4.5 / 10) |>
ungroup()
Normalization with the class-specific ISTD often helps to remove systematic drifts and batch effects, but may also introduce additional noise and artefacts. Let’s have a look on the how the data looks after normalization.
Before we plotted the ISTD runscatter in one page, however if we
would like to look at all spececies we could distribute the plots over
several pages. There are different ways to archive this. One possibility
is using facet_wrap_paginate() from the
ggforce package, but this can be slow when having large
datasets. We here are using another, manual, approach, by slicing the
long table into pages that will then be plotted.
Furthermore, since we are going to use this plot again later, we make function for this plot, so that we can coveniently plot again later.
# We define the function
runscatter <- function(data, var){
plot_page <- function(data, nrows, ncols){
ggplot(data, aes(x=RUN_ID, y={{var}})) +
geom_point(aes(colour = QC_TYPE, fill = QC_TYPE, shape = QC_TYPE),
size = 1, alpha =0.7, stroke = 0.3) +
facet_wrap(vars(LIPID), ncol = ncols, nrow = nrows, scales="free_y") +
geom_smooth(data= subset(data, QC_TYPE=="BQC"), aes(x=RUN_ID, y={{var}}),
method = "loess", span = 0.75, formula = y ~ x, se = FALSE,
na.rm = TRUE, color = "brown3", linewidth = .7)+
scale_shape_manual(na.value = NA, values = qc_shapes) +
scale_fill_manual(values = qc_colors, na.value = NA) +
scale_colour_manual(values = qc_colors, na.value = NA) +
scale_x_continuous(breaks = seq(0, max(d_istd$RUN_ID), by = 100 )) +
scale_y_continuous(limits = c(0, NA)) +
theme_bw(base_size = 8)
}
rows_page = 5
columns_page = 5
#get a table with page numbers for each lipid species
d_pages <- data |> select(LIPID) |> distinct() |>
mutate(page_no = ceiling(row_number() / (rows_page * columns_page)))
#plot each page from a nested table
d_plots <- data %>%
filter(!str_detect(QC_TYPE, "BLK|RQC"), !str_detect(LIPID, "ISTD")) |>
left_join(d_pages, by = "LIPID") %>%
group_by(page_no) |>
nest() |>
mutate(plt = map(data, ~ plot_page(., rows_page, columns_page)))
# Save pages to a PDF.
pdf(file = here(paste0("Part_1/output/runscatter_", quo_name(enquo(var)), ".pdf")),
onefile = TRUE, width = 280/25.4, height = 180/25.4)
#d_plots$plt
invisible(purrr::walk(d_plots$plt, print)) # use this to prevent printing of index
dev.off()
}
# and run it twice, plotting raw areas and concentrations
runscatter(d_processed, AREA)
#> quartz_off_screen
#> 2
runscatter(d_processed, CONC)
#> quartz_off_screen
#> 2
We observe some drifts in some lipid species - this even after
normalization with the ISTD. We will now try to correct these drifts.
While drifts are often cause by gradual changes in instrument
performance, batch effects can e.g. occur when samples are extracted in
batches or when e.g. the instrument needs to be stopped. In both cases,
we here assume that the that the BQC (=batch/process QC)
represent drift and batch effects seen also in the study samples. First,
we apply an within-batch smoothing, and then align the batches using
median centering.
For the smoothing in the code example below, we will use
loess based on log-transformed data to make it more robust.
Setting of the smoothing parameter, i.e. loess span,
depends on your data, we will chose the default of 0.75.
Note, that for the analysis in the publication another algorithm, built-in to MRMkit (Teo et al., n.d.), was used, which is based on the all data points rather than just QCs.
d_processed_corr <- d_processed # make a new copy
get_loess <- function(d) {
tryCatch({
dt <- tibble::tibble(RUN_ID = seq(min(d$RUN_ID), max(d$RUN_ID), 1))
res <- stats::loess(
data = subset(d, QC_TYPE == "BQC"), formula = CONC_LOG ~ RUN_ID, span = 0.75) |>
stats::predict(dt) %>% as.numeric()
res},
error = function(e) {return(rep(NA_real_, length(d$RUN_ID)))})
}
# Within-batch smoothing
d_processed_corr$CONC_LOG <- log2(d_processed_corr$CONC)
d_processed_corr <- d_processed_corr |>
group_by(LIPID, BATCH) |>
nest() |>
mutate(Y_PREDICTED = purrr::map(data, \(x) get_loess(x))) |>
unnest(cols = c(data, Y_PREDICTED))
d_processed_corr <- d_processed_corr %>%
group_by(LIPID, BATCH) |>
mutate(Y_PREDICTED = Y_PREDICTED - median(Y_PREDICTED, na.rm = TRUE),
CONC_ADJ = 2^(CONC_LOG - Y_PREDICTED)) |> ungroup()
# Between-batch median-centering
d_processed_corr <- d_processed_corr |>
dplyr::group_by(LIPID, BATCH) |>
dplyr::mutate(CONC_ADJ = CONC_ADJ/median(CONC_ADJ[QC_TYPE == "BQC"], na.rm = TRUE)) |>
dplyr::ungroup()
d_processed_corr <- d_processed_corr |>
dplyr::group_by(LIPID) |>
dplyr::mutate(CONC_ADJ = CONC_ADJ * median(CONC_ADJ[QC_TYPE == "BQC"], na.rm = TRUE)) |>
dplyr::ungroup()
runscatter(d_processed_corr, CONC_ADJ)
#> quartz_off_screen
#> 2
To evaluate the quality of the analysis and to filter the date we calculate different QC values for each lipid species. This included the analytical coefficient of variation (%CV) based on the BQCs, the signal-to-blank ratio, and the r squared of the response curves.
A word of caution here: we apply drift/batch correction to all species here, regardless if there are drift/batch effects. Such correction can also introduce variability and artefacts. Furthermore, we are using the BQCs for smoothing and median centering, the %CVs of the BQCs are therefore as consequence. Ideally we would use a QC subset or another QC set to determine the %CV.
## run this line below if you want test QC filtering without drift/batch correction
#d_processed_corr <- d_processed |> mutate(CONC_ADJ = CONC) # !!!!!!! overwrites correction
rsd <- function(x) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE)
d_qc_1 <- d_processed_corr |>
group_by(LIPID) |>
summarise(
Area_SPL = median(AREA[QC_TYPE == "SAMPLE"], rm.na = TRUE),
SB_ratio = Area_SPL/median(AREA[QC_TYPE == "PBLK"], rm.na = TRUE),
Conc_SPL = median(CONC_ADJ[QC_TYPE == "SAMPLE"], rm.na = TRUE),
CV_TQC = rsd(CONC_ADJ[QC_TYPE == "TQC"]) * 100,
CV_BQC = rsd(CONC_ADJ[QC_TYPE == "BQC"]) * 100,
CV_SPL = rsd(CONC_ADJ[QC_TYPE == "SAMPLE"]) * 100,
D_ratio = sd(CONC_ADJ[QC_TYPE == "BQC"])/sd(CONC_ADJ[QC_TYPE == "SAMPLE"])) |> ungroup()
f <- function(x) broom::glance(lm(AREA ~ AMOUNT, data = x))
d_qc_2 <- d_rqc |>
group_by(LIPID, CURVE_NO) |>
nest() |>
mutate(res = purrr::map(data, f)) |>
unnest(res)
d_qc_2 <- d_qc_2 |>
select(LIPID, CURVE_NO, r.squared, p.value) |>
pivot_wider(names_from = CURVE_NO, values_from = c(r.squared, p.value))
d_qc <- d_lipids |>
left_join(d_qc_1, by = "LIPID") |>
left_join(d_qc_2, by = "LIPID")
write_csv(x = d_qc, file = here("Part_1/output/QC-summary.csv"))
Now we apply a QC-filtering step to the data using our chosen criteria, e.g. CV < 25%. You can play we these parameters and see how it affects the number of species that pass/fail this step.
For some lipid classes, i.e. DGs , each species was measured via 2 transitions. Which transition is used for quantification was indicated in the metadata file (peakQC), so we can filter for quantifiers. Futhermore, we exclude the ISTDs from the final dataset.
d_qc <- d_qc |>
mutate(
QC_pass =
(CV_BQC < 25 | (CV_BQC < 50 & D_ratio < 0.5)) &
SB_ratio > 3 &
r.squared_1 > 0.8 &
QUANTIFIER &
!str_detect(LIPID, "ISTD"))
print(paste0("QC filtering: ", nrow(d_qc[d_qc$QC_pass, ]), " of ", nrow(d_qc), " species passed QC"))
#> [1] "QC filtering: 311 of 413 species passed QC"
So, a total of 311 of 413 species passed QC. Let us check, if there are lipid classes with many species that failed QC.
We now can (and should) have a look at how many species passed the QC criteria and if there are any pattern specific to lipid classes.
d_qc_summary <- d_qc |> filter(QUANTIFIER) |> dplyr::count(CLASS, QC_pass)
p <- ggplot(d_qc_summary,
aes(x = CLASS, y = n, fill = QC_pass, group = QC_pass)) +
geom_bar(position="stack", stat="identity") +
scale_fill_manual(values = c("FALSE" = "#d6cac9", "TRUE" = "green4")) +
scale_x_discrete(limits=rev)+
coord_flip() + theme_bw()
ggsave(plot = p, filename = here("Part_1/output/qc_summary.pdf"),
width = 260, height = 160, units = "mm")
And as a last step we use a PCA to understand how our BQCs performed and if we have any outliers that need be investigated.
d_pca <- d_processed_corr |>
filter(QC_TYPE %in% c("BQC", "TQC", "SAMPLE"), !str_detect(LIPID, "IS")) |>
select(FILENAME, QC_TYPE, BATCH, LIPID, CONC_ADJ) |>
pivot_wider(names_from = "LIPID", values_from = CONC_ADJ) |>
drop_na() # NA rows of rows that could not be smoothed before
# This is the outlier !
d_pca <- d_pca |> filter(FILENAME != "LT_batch6_51")
d_metadata <- d_pca |> select(FILENAME, QC_TYPE, BATCH) |> distinct()
m_raw <- d_pca |> select(-QC_TYPE, -BATCH) %>%
select(where(~!any(is.na(.)))) |>
column_to_rownames("FILENAME") |>
as.matrix()
m_raw <- log2(m_raw)
dim_x <- 1
dim_y <- 2
# get pca result with annotation
pca_res <- prcomp(m_raw, scale = TRUE, center = TRUE)
pca_annot <- pca_res |> broom::augment(d_metadata)
pca_contrib <- pca_res |> broom::tidy(matrix = "eigenvalues")
p <- ggplot(data = pca_annot, aes_string(paste0(".fittedPC", dim_x),
paste0(".fittedPC", dim_y),
color = "QC_TYPE",
fill = "QC_TYPE",
shape = "QC_TYPE",
label = "FILENAME"
)) +
ggplot2::geom_hline(yintercept = 0, size = 0.4, color = "grey80") +
ggplot2::geom_vline(xintercept = 0, size = 0.4, color = "grey80") +
ggplot2::stat_ellipse(geom = "polygon", level = 0.95,alpha = 0.1, size = 0.3) +
ggplot2::geom_point(size = 1)
p <- p +
ggplot2::scale_color_manual(values = qc_colors, drop=TRUE) +
ggplot2::scale_fill_manual(values = qc_colors, drop=TRUE)+
ggplot2::scale_shape_manual(values = qc_shapes, drop=TRUE)
p <- p +
ggplot2::theme_light(base_size = 10) +
ggplot2::xlab(glue::glue("PC{dim_x} ({round(pca_contrib[[dim_x,'percent']]*100,1)}%)"))+
ggplot2::ylab(glue::glue("PC{dim_y} ({round(pca_contrib[[dim_y,'percent']]*100,1)}%)"))+
ggplot2::theme(
panel.grid = ggplot2::element_line(size = 0.3, color = "grey95"),
panel.border = ggplot2::element_rect(size = 1, color = "grey70"),
aspect.ratio=1)
ggsave(plot = p, filename = here("Part_1/output/PCA_QC.pdf"),
width = 170, height = 170, units = "mm")
p
We now save the final, drift-corrected and QC-filtered concentration data as a wide CSV table, which then will be used for Part 2. Have fun! :)
# QC filter data
d_final <- d_processed_corr |>
filter(QC_TYPE == "SAMPLE", !str_detect(LIPID, "ISTD")) |>
right_join(d_qc[d_qc$QC_pass,"LIPID"], by = "LIPID")
d_final_wide <- d_final |>
pivot_wider(id_cols = c(FILENAME, QC_TYPE), names_from = "LIPID", values_from = "CONC_ADJ")
write_csv(d_final_wide, here("Part_1/output/qc_filtered_results.csv"))